This is a self-contained data analysis report.
The main hypothesis of this study were evaluated in consideration of daytime pumping station and artificial habitat use, which were
Roach will occupy pumping station when artificial habitat is absent
When introduced (pre-exclusion), roach will show equal preference towards artificial habitat and pumping station
When excluded from the pumping station, artificial habitat occupancy will increase
Roach will show a preferential change (compared to pre-exclusion) to occupying artificial habitat when the pumping station is reintroduced (post-exclusion)
Artificial habitat occupancy will be higher in sheltered (vs unsheltered) treatments
library(tidyverse)
library(knitr)
library(kableExtra)
library(funModeling)
library(dlookr)
library(glmmTMB)
library(ggeffects)
library(DHARMa)
library(ggpubr)
roach_wide.csv - wide data set where fish counts are stored in columns c_hab, c_open, c_ps
roach_wide=read_csv("./data/roach_wide.csv")
Determine dataset structure by checking number of rows, columns and data types (i.e., numerical, factors). Identify categorical factors and their levels.
nrow(roach_wide)
## [1] 6942
ncol(roach_wide)
## [1] 24
colnames(roach_wide)
## [1] "date" "time" "run" "trial"
## [5] "treatment" "ps_tank_end" "room_end" "tank"
## [9] "light" "sequence" "day" "seq_day"
## [13] "hab_avail" "ps_avail" "both_avail" "hours_havail"
## [17] "hours_lout" "c_hab" "c_open_screen" "c_open_cent"
## [21] "c_open_hab" "c_open" "c_ps" "c_both"
funModeling:df_status - For each variable it returns: Quantity and percentage of zeros (q_zeros and p_zeros respectively). Same metrics for NA values (q_NA/p_na), and infinite values (q_inf/p_inf). Last two columns indicates data type and quantity of unique values.
roach_wide_status<- funModeling::df_status(roach_wide, print_results = F)
| variable | q_zeros | p_zeros | q_na | p_na | q_inf | p_inf | type | unique |
|---|---|---|---|---|---|---|---|---|
| date | 0 | 0.00 | 0 | 0.00 | 0 | 0 | character | 52 |
| time | 288 | 4.15 | 0 | 0.00 | 0 | 0 | hms-difftime | 24 |
| run | 0 | 0.00 | 0 | 0.00 | 0 | 0 | numeric | 4 |
| trial | 0 | 0.00 | 0 | 0.00 | 0 | 0 | numeric | 24 |
| treatment | 3471 | 50.00 | 0 | 0.00 | 0 | 0 | numeric | 2 |
| ps_tank_end | 3468 | 49.96 | 0 | 0.00 | 0 | 0 | numeric | 2 |
| room_end | 3471 | 50.00 | 0 | 0.00 | 0 | 0 | numeric | 2 |
| tank | 0 | 0.00 | 0 | 0.00 | 0 | 0 | numeric | 6 |
| light | 2616 | 37.68 | 0 | 0.00 | 0 | 0 | numeric | 2 |
| sequence | 0 | 0.00 | 0 | 0.00 | 0 | 0 | numeric | 4 |
| day | 0 | 0.00 | 0 | 0.00 | 0 | 0 | numeric | 13 |
| seq_day | 0 | 0.00 | 0 | 0.00 | 0 | 0 | numeric | 52 |
| hab_avail | 5214 | 75.11 | 0 | 0.00 | 0 | 0 | numeric | 2 |
| ps_avail | 5166 | 74.42 | 0 | 0.00 | 0 | 0 | numeric | 2 |
| both_avail | 3504 | 50.48 | 0 | 0.00 | 0 | 0 | numeric | 2 |
| hours_havail | 0 | 0.00 | 3516 | 50.65 | 0 | 0 | numeric | 72 |
| hours_lout | 288 | 4.15 | 2334 | 33.62 | 0 | 0 | numeric | 16 |
| c_hab | 3201 | 46.11 | 0 | 0.00 | 0 | 0 | numeric | 13 |
| c_open_screen | 5683 | 81.86 | 0 | 0.00 | 0 | 0 | numeric | 13 |
| c_open_cent | 5331 | 76.79 | 0 | 0.00 | 0 | 0 | numeric | 13 |
| c_open_hab | 4510 | 64.97 | 0 | 0.00 | 0 | 0 | numeric | 13 |
| c_open | 3732 | 53.76 | 0 | 0.00 | 0 | 0 | numeric | 13 |
| c_ps | 3724 | 53.64 | 0 | 0.00 | 0 | 0 | numeric | 13 |
| c_both | 599 | 8.63 | 0 | 0.00 | 0 | 0 | numeric | 13 |
Several variables need to be converted to factors and labels added to levels.
Create a lookup table for variable labels.
labels_table <- list(
treatment = c('Covered (B)','Uncovered (A)'),
ps_tank_end = c('RH', 'LH'),
room_end = c('Far', 'Near'),
light = c('Day', 'Night'),
hab_avail = c('AH unavailable', 'AH available'),
ps_avail = c('PS available','PS unavailable'),
both_avail = c('Single Available', 'Both Available'),
sequence = c('Baseline', 'I 1', 'I 2', 'I 3')
)
Convert variables to factors with specific labeling using a loop. Use the lookup table to get the labels for each variable.
for (var in names(labels_table)) {
levels <- unique(roach_wide[[var]])
labels <- labels_table[[var]]
roach_wide[[var]] <- factor(roach_wide[[var]], levels = levels, labels = labels)
}
Convert other variables to factors without adding labels.
other_vars <- c("hours_havail", "hours_lout", "run", "day", "trial")
roach_wide[other_vars] <- lapply(roach_wide[other_vars], as.factor)
Using the roach_wide_status dataframe consider variables with NA values.
roach_na <- roach_wide_status %>%
filter(q_na > 0) %>%
arrange(-p_na) %>%
select(variable, q_na, p_na)
| variable | q_na | p_na |
|---|---|---|
| hours_havail | 3516 | 50.65 |
| hours_lout | 2334 | 33.62 |
Variables with NA not considered in main analysis, NAs can be omitted for visual analysis.
Using the roach_wide_status dataframe consider the spread of zeros in habitat count variables
roach_0 <- roach_wide_status %>%
filter(variable %in% c("c_hab", "c_ps", "c_open", "c_open_cent", "c_open_screen", "c_open_hab")) %>%
arrange(-p_zeros) %>%
select(variable, q_zeros, p_zeros)
| variable | q_zeros | p_zeros |
|---|---|---|
| c_open_screen | 5683 | 81.86 |
| c_open_cent | 5331 | 76.79 |
| c_open_hab | 4510 | 64.97 |
| c_open | 3732 | 53.76 |
| c_ps | 3724 | 53.64 |
| c_hab | 3201 | 46.11 |
High proportion of 0’s in all counts, but will be confounded without consideration for light period and habitat availability. Regardless, the presence of high 0’s will significantly skew variability and make raw counts hard to interpret. Descriptive statistics would be limited by high IQR and misleading min/max values. Consider rescaling count variables for analysis.
Check the main dataframe for outliers.
dlookr::plot_outlier - for each variable specified the function plots outlier information for numerical data diagnosis
dlookr::plot_outlier(roach_wide, "c_hab", "c_ps", "c_open")
Check the main dataframe for data distribution.
dlookr::plot_normality - for each variable specified the function determines normality by plotting histogram and q-q plot of the original data, log transformed, and square root transformed.
dlookr::plot_normality(roach_wide, "c_hab", "c_ps", "c_open")
The code performs a two-step transformation: it first normalizes raw fish count data to a 0-1 scale, and then ensures that the normalized values have a small positive offset to prevent division by zero issues for modelling.
roach_wide <- roach_wide %>%
mutate(across(c(c_hab, c_ps, c_open), ~ . / max(.), .names = "{.col}_normalized"),
across(ends_with("_normalized"), ~ ifelse(. == 0, . + 0.0000000001, .)))
First, create minimal dataset with variables of interest.
roach_wide_sum = select(roach_wide, c_hab, c_ps, c_open, c_hab_normalized, c_ps_normalized, c_open_normalized, sequence, light, treatment)
funModeling::profile_num - Provides an expanded summary including mean, standard deviation, skewness and kurtosis.
Skewness >0 = right-skew, <0 = left-skew, 0 = symmetric. kurtosis >3 = sharp, <3 = flat, 3 = normal
numeric_prof <-funModeling::profiling_num(roach_wide_sum)
| variable | mean | std_dev | variation_coef | p_01 | p_05 | p_25 | p_50 | p_75 | p_95 | p_99 | skewness | kurtosis | iqr | range_98 | range_80 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| c_hab | 4.1780467 | 4.7238987 | 1.130648 | 0 | 0 | 0 | 2.0000000 | 8.0000000 | 12 | 12 | 0.6180048 | 1.782165 | 8.0000000 | [0, 12] | [0, 12] |
| c_ps | 4.3388073 | 5.4305629 | 1.251626 | 0 | 0 | 0 | 0.0000000 | 12.0000000 | 12 | 12 | 0.6121211 | 1.476341 | 12.0000000 | [0, 12] | [0, 12] |
| c_open | 3.4462691 | 4.3204959 | 1.253673 | 0 | 0 | 0 | 0.0000000 | 7.0000000 | 12 | 12 | 0.8004510 | 2.112379 | 7.0000000 | [0, 12] | [0, 11] |
| c_hab_normalized | 0.3481706 | 0.3936582 | 1.130648 | 0 | 0 | 0 | 0.1666667 | 0.6666667 | 1 | 1 | 0.6180048 | 1.782165 | 0.6666667 | [1e-10, 1] | [1e-10, 1] |
| c_ps_normalized | 0.3615673 | 0.4525469 | 1.251626 | 0 | 0 | 0 | 0.0000000 | 1.0000000 | 1 | 1 | 0.6121211 | 1.476341 | 1.0000000 | [1e-10, 1] | [1e-10, 1] |
| c_open_normalized | 0.2871891 | 0.3600413 | 1.253673 | 0 | 0 | 0 | 0.0000000 | 0.5833333 | 1 | 1 | 0.8004510 | 2.112379 | 0.5833333 | [1e-10, 1] | [1e-10, 0.916666666666667] |
Standard deviations of raw counts are high, as expected from normality plots.
Examine raw count descriptive summary before considering rescaled values.
count_sum <- bind_rows(
roach_wide_sum %>%
group_by(light) %>%
summarise(sum_c_hab = sum(c_hab),
sum_c_ps = sum(c_ps),
sum_c_open = sum(c_open)) %>%
rename(variable = light),
roach_wide_sum %>%
group_by(sequence) %>%
summarise(sum_c_hab = sum(c_hab),
sum_c_ps = sum(c_ps),
sum_c_open = sum(c_open)) %>%
rename(variable = sequence),
roach_wide_sum %>%
summarise(sum_c_hab = sum(c_hab),
sum_c_ps = sum(c_ps),
sum_c_open = sum(c_open),
variable = "Total") %>%
mutate(variable = factor(variable)))
| variable | sum_c_hab | sum_c_ps | sum_c_open |
|---|---|---|---|
| Light period | |||
| Day | 15011 | 13426 | 2790 |
| Night | 13993 | 16694 | 21134 |
| Experimental sequence | |||
| Baseline | 0 | 15014 | 6283 |
| I 1 | 4483 | 13129 | 3041 |
| I 2 | 13272 | 0 | 7327 |
| I 3 | 11249 | 1977 | 7273 |
| Total count | |||
| Total | 29004 | 30120 | 23924 |
Examine rescaled counts
rescaled_mean <- bind_rows(
roach_wide_sum %>%
group_by(light) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
mean_c_ps = mean(c_ps_normalized),
mean_c_open = mean(c_open_normalized)) %>%
rename(variable = light),
roach_wide_sum %>%
group_by(sequence) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
mean_c_ps = mean(c_ps_normalized),
mean_c_open = mean(c_open_normalized)) %>%
rename(variable = sequence),
roach_wide_sum %>% filter(light=="Day")%>%
group_by(treatment) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
mean_c_ps = mean(c_ps_normalized),
mean_c_open = mean(c_open_normalized)) %>%
rename(variable = treatment))
| variable | mean_c_hab | mean_c_ps | mean_c_open |
|---|---|---|---|
| Light period | |||
| Day | 0.4781792 | 0.4276886 | 0.0888761 |
| Night | 0.2695523 | 0.3215827 | 0.4071120 |
| Experimental sequence | |||
| Baseline | 0.0000000 | 0.7044857 | 0.2948104 |
| I 1 | 0.2161941 | 0.6331501 | 0.1466532 |
| I 2 | 0.6400463 | 0.0000000 | 0.3533468 |
| I 3 | 0.5481969 | 0.0963450 | 0.3544347 |
| Treatment (daytime) | |||
| Covered (B) | 0.4968782 | 0.4083206 | 0.0899592 |
| Uncovered (A) | 0.4594801 | 0.4470566 | 0.0877931 |
Apply custom themeing for ggplot2 objects.
theme_JN <- function(base_size=12){
theme_grey() %+replace%
theme(
axis.text = element_text(colour="black"),
axis.title = element_text(colour="black"),
axis.ticks = element_line(colour="black"),
panel.border = element_rect(colour = "black", fill=NA),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
strip.background = element_rect(colour = "black",fill = NA),
panel.spacing.x = unit(12, "pt")
)
}
Plot raw count data first to confirm suitability of rescaling data.
ggplot(roach_wide_sum, aes(x = sequence, y= c_ps)) +
geom_boxplot() +theme_JN()+theme(text=element_text(size=20))
ggplot(roach_wide_sum, aes(x = sequence, y= c_hab)) +
geom_boxplot()+theme_JN()+theme(text=element_text(size=20))
ggplot(roach_wide_sum, aes(x = sequence, y= c_open)) +
geom_boxplot()+theme_JN()+theme(text=element_text(size=20))
The boxplots confirm raw count data are unsuitable for analysis due to large variation in counts presenting large IQR + outliers. Descriptive data e.g., medians are therefore hard to interpret and generalise.
From here on all analysis considers rescaled counts.
#Artificial habitat,light
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
group_by(light) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = light, y = mean_c_hab)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#pumping station, light
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
group_by(light) %>%
summarise(mean_c_hab = mean(c_ps_normalized),
se_c_hab = sd(c_ps_normalized) / sqrt(n())),
aes(x = light, y = mean_c_hab)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#open water, light
ggplot(roach_wide_sum %>%
group_by(light) %>%
summarise(mean_c_hab = mean(c_open_normalized),
se_c_hab = sd(c_open_normalized) / sqrt(n())),
aes(x = light, y = mean_c_hab)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#Artificial habitat, sequence, light
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
group_by(sequence, light) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = sequence, y = mean_c_hab, colour = light)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#pumping station, sequence, light
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
group_by(sequence, light) %>%
summarise(mean_c_hab = mean(c_ps_normalized),
se_c_hab = sd(c_ps_normalized) / sqrt(n())),
aes(x = sequence, y = mean_c_hab, colour = light)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#open water, sequence, light
ggplot(roach_wide_sum %>%
group_by(sequence, light) %>%
summarise(mean_c_hab = mean(c_open_normalized),
se_c_hab = sd(c_open_normalized) / sqrt(n())),
aes(x = sequence, y = mean_c_hab, colour = light)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
By plotting the main visual relationships we’re able to quickly identify relationships in the data and begin addressing study objectives and hypothesis.
Here, we can consider that the day/night relationship in habitat occupancy was relatively fixed throughout the experiment in all habitat options. We see that open water occupancy was decreased in intervention 1, possibly attributed to introducing artificial habitat.
To consider H5, let’s examine treatment effects.
#Artificial habitat, sequence, treatment
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
group_by(sequence, treatment) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = sequence, y = mean_c_hab, colour = treatment)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#pumping station, sequence, treatment
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
group_by(sequence, treatment) %>%
summarise(mean_c_hab = mean(c_ps_normalized),
se_c_hab = sd(c_ps_normalized) / sqrt(n())),
aes(x = sequence, y = mean_c_hab, colour = treatment)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#open water, sequence, treatment
ggplot(roach_wide_sum %>%
group_by(sequence, treatment) %>%
summarise(mean_c_hab = mean(c_open_normalized),
se_c_hab = sd(c_open_normalized) / sqrt(n())),
aes(x = sequence, y = mean_c_hab, colour = treatment)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#Artificial habitat, treatment, light
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>%
group_by(treatment, light) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = treatment, y = mean_c_hab, colour = light)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#daytime counts highest in covered treatments
#pumping station, treatment, light
ggplot(roach_wide_sum %>% filter(sequence!="I 2")%>%
group_by(treatment, light) %>%
summarise(mean_c_hab = mean(c_ps_normalized),
se_c_hab = sd(c_ps_normalized) / sqrt(n())),
aes(x = treatment, y = mean_c_hab, colour = light)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#open water, treatment, light
ggplot(roach_wide_sum %>%
group_by(treatment, light) %>%
summarise(mean_c_hab = mean(c_open_normalized),
se_c_hab = sd(c_open_normalized) / sqrt(n())),
aes(x = treatment, y = mean_c_hab, colour = light)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#Artificial habitat, treatment (daytime)
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>% filter(light=="Day")%>%
group_by(treatment, light) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = treatment, y = mean_c_hab, colour = light)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
#Artificial habitat, sequence, treatment (daytime)
ggplot(roach_wide_sum %>% filter(sequence!="Baseline")%>% filter(light=="Day")%>%
group_by(sequence, treatment) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = sequence, y = mean_c_hab, colour = treatment)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
In these plots we see that treatment has an impact on habitat occupancy and in all sequences, habitat occupancy in artificial habitat was highest in uncovered treatments.Interestingly, occupancy in pumping station shows the opposite relationship to artificial habitat. Pumping station occupancy was highest in uncovered treatments, suggesting that cover is important i.e., when uncovered habitat is provided, more fish occupy the pumping station. Overall, we found support for H5.
Before continuing with statistical analysis let’s examine repeated measures and temporal effects.
#Artificial habitat, trial (daytime)
ggplot(roach_wide %>% filter(sequence!="Baseline")%>% filter(light=="Day")%>%
group_by(trial) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = trial, y = mean_c_hab)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
The variation between trials suggests significant effect of within-subject observations (i.e., repeated measures). Additionally, there appears to be temporal influence (ex. trial 1 compared to trial 24)
ggplot(roach_wide %>% filter(sequence!="Baseline")%>%
group_by(time) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = time, y = mean_c_hab)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
Evidence for hour-to-hour variation. Ensure is included in analysis.
Finally, considering further relationships not required for main objectives.
ggplot(roach_wide %>%
filter(!is.na(hours_havail))%>% filter(sequence =="I 1")%>%
group_by(hours_havail) %>%
summarise(mean_c_hab = mean(c_hab_normalized),
se_c_hab = sd(c_hab_normalized) / sqrt(n())),
aes(x = hours_havail, y = mean_c_hab)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
roach_wide %>%
filter(!is.na(hours_havail)) %>%
filter(sequence == "I 1") %>%
mutate(hours = as.numeric(hours_havail)) %>%
with(cor.test(hours, c_hab_normalized))
##
## Pearson's product-moment correlation
##
## data: hours and c_hab_normalized
## t = 4.8211, df = 1702, p-value = 1.555e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06896398 0.16266011
## sample estimates:
## cor
## 0.1160703
There is no real relationship. Occupancy peaks 24h after introduction, as expected due to nocturnal behavior. Correlation testing shows a weak positive correlation, but significant.
ggplot(roach_wide %>%
filter(!is.na(hours_lout))%>%
group_by(hours_lout) %>%
summarise(mean_c_hab = mean(c_open_normalized),
se_c_hab = sd(c_open_normalized) / sqrt(n())),
aes(x = hours_lout, y = mean_c_hab)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_c_hab - se_c_hab, ymax = mean_c_hab + se_c_hab),
width = 0.2)+theme_JN()+theme(text=element_text(size=20))
roach_wide %>%
filter(!is.na(hours_lout)) %>%
mutate(hours = as.numeric(hours_lout)) %>%
with(cor.test(hours, c_open_normalized))
##
## Pearson's product-moment correlation
##
## data: hours and c_open_normalized
## t = 9.4592, df = 4606, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1096056 0.1662548
## sample estimates:
## cor
## 0.1380431
Again, no strong relationship. Occupancy peaks within 1h of lights out. Correlation testing shows a weak positive correlation, but significant.
Ordinarily we may now choose to perform group comparisons etc on our variables of interest to support the relationships shown in the visual analysis. However, the nature of the problem is complex and thus is best treated by a modelling approach.
Data exploration has shown:
table(roach_wide$sequence)
##
## Baseline I 1 I 2 I 3
## 1776 1728 1728 1710
table(roach_wide$treatment)
##
## Covered (B) Uncovered (A)
## 3471 3471
table(roach_wide$light)
##
## Day Night
## 2616 4326
The data is well-balanced across the grouping variables except light, which is expected due to day night differences.
Let’s extract hour from the time variable and store as factor for random effect modelling.
roach_wide <- roach_wide %>%
mutate(time_factor = factor(as.numeric(format(strptime(time, format = "%H:%M:%S"), "%H"))))
Create a new dataframe for binary modelling later.
#Create new DF for binary model
roach_binary <- roach_wide %>%filter (sequence=="I 1"|sequence=="I 3")%>%filter(light=="Day")
roach_binary$binary <- ifelse(roach_binary$sequence =="I 1", 0,1)
Start by creating a null (intercept only) model.
mod1 <- glmmTMB(c_hab_normalized ~ 1, data = roach_wide%>%filter(sequence!="Baseline"),
family = gaussian(link="log"))
summary(mod1)
## Family: gaussian ( log )
## Formula: c_hab_normalized ~ 1
## Data: roach_wide %>% filter(sequence != "Baseline")
##
## AIC BIC logLik deviance df.resid
## 4939.5 4952.6 -2467.8 4935.5 5164
##
##
## Dispersion estimate for gaussian family (sigma^2): 0.152
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7596 0.0116 -65.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Intrepting the null model is not important here, but we can see that the estimated mean of our normalised count variable is significantly different from 0. We will use this null model for determining model fit by comparing subsequent models to the null using AIC selection.
Let’s add our fixed effects of interest.
mod1.1 <- update(mod1, c_hab_normalized ~ sequence + light + treatment)
summary(mod1.1)
## Family: gaussian ( log )
## Formula: c_hab_normalized ~ sequence + light + treatment
## Data: roach_wide %>% filter(sequence != "Baseline")
##
## AIC BIC logLik deviance df.resid
## 2422.0 2461.3 -1205.0 2410.0 5160
##
##
## Dispersion estimate for gaussian family (sigma^2): 0.0934
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.20486 0.03538 -34.05 < 2e-16 ***
## sequenceI 2 1.12453 0.03607 31.18 < 2e-16 ***
## sequenceI 3 1.02292 0.03656 27.98 < 2e-16 ***
## lightNight -0.62328 0.01728 -36.07 < 2e-16 ***
## treatmentUncovered (A) -0.08459 0.01619 -5.22 1.76e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model outputs are sensible and have reduced the AIC of the null model from 4939 to 2422. All our predictor terms are significant with sensible coefficent and error estimates. For example, we can see that the mean value of c_hab_normalized is reduced by -0.623 units during night time light conditions.
It can be useful to plot model predictions early whilst building a complete model to understand how adding independent variables affects the estimated response.
plot(ggpredict(mod1.1, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))
Here, we can see that the predicted habitat occupancy response is sensible, and close to the observed values we plotted earlier.
Importantly, random effects need to be added to account for repeated measures and temporal dependency.
mod1.2 <- update(mod1.1, . ~ . + (1 | time_factor/day/trial))
summary(mod1.2)
## Family: gaussian ( log )
## Formula:
## c_hab_normalized ~ sequence + light + treatment + (1 | time_factor/day/trial)
## Data: roach_wide %>% filter(sequence != "Baseline")
##
## AIC BIC logLik deviance df.resid
## 2401.6 2460.5 -1191.8 2383.6 5157
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## trial:day:time_factor (Intercept) 1.001e-10 0.00001
## day:time_factor (Intercept) 7.687e-03 0.08767
## time_factor (Intercept) 9.906e-04 0.03147
## Residual 9.123e-02 0.30204
## Number of obs: 5166, groups:
## trial:day:time_factor, 5166; day:time_factor, 216; time_factor, 24
##
## Dispersion estimate for gaussian family (sigma^2): 0.0912
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.19015 0.03873 -30.731 < 2e-16 ***
## sequenceI 2 1.12143 0.03946 28.418 < 2e-16 ***
## sequenceI 3 0.98280 0.04094 24.007 < 2e-16 ***
## lightNight -0.62157 0.02527 -24.600 < 2e-16 ***
## treatmentUncovered (A) -0.08580 0.01598 -5.369 7.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(ggpredict(mod1.2, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))
As before, the model predictions are improved and the within-subject variation has been handled by adding the random effect of trial.
We can check the goodness of fit by plotting residuals using DHARMA.
fittedmod1.2 <- mod1.2
simuout1 <- simulateResiduals(fittedModel = fittedmod1.2)
plot(simuout1, quantreg = T)
The residuals generally follow a linear relationship and he deviation is accepted within a generalised framework. The quantile deviations present are likely a result of dispersion due to clumped values close to 0 and 1.
The model process is now repeated for the remaining response variables; pumping station and open water.
#Null model
mod2 <- glmmTMB(c_ps_normalized ~ 1, data = roach_wide%>%filter(sequence!="I 2"),
family = gaussian(link="log"))
summary(mod2)
## Family: gaussian ( log )
## Formula: c_ps_normalized ~ 1
## Data: roach_wide %>% filter(sequence != "I 2")
##
## AIC BIC logLik deviance df.resid
## 6784.9 6798.0 -3390.4 6780.9 5212
##
##
## Dispersion estimate for gaussian family (sigma^2): 0.215
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.73106 0.01334 -54.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add fixed effects
mod2.1 <- update(mod2, c_ps_normalized ~ sequence + light + treatment)
summary(mod2.1)
## Family: gaussian ( log )
## Formula: c_ps_normalized ~ sequence + light + treatment
## Data: roach_wide %>% filter(sequence != "I 2")
##
## AIC BIC logLik deviance df.resid
## 4372.6 4411.9 -2180.3 4360.6 5208
##
##
## Dispersion estimate for gaussian family (sigma^2): 0.135
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.20719 0.01776 -11.665 < 2e-16 ***
## sequenceI 1 -0.10587 0.01847 -5.733 9.88e-09 ***
## sequenceI 3 -2.01448 0.09481 -21.248 < 2e-16 ***
## lightNight -0.29107 0.01827 -15.932 < 2e-16 ***
## treatmentUncovered (A) 0.05181 0.01827 2.836 0.00457 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add random effects
mod2.2 <- update(mod2.1, . ~ . + (1 | time_factor/day/trial))
summary(mod2.2)
## Family: gaussian ( log )
## Formula:
## c_ps_normalized ~ sequence + light + treatment + (1 | time_factor/day/trial)
## Data: roach_wide %>% filter(sequence != "I 2")
##
## AIC BIC logLik deviance df.resid
## 4218.1 4277.1 -2100.1 4200.1 5205
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## trial:day:time_factor (Intercept) 9.025e-02 3.004e-01
## day:time_factor (Intercept) 2.876e-11 5.363e-06
## time_factor (Intercept) 3.499e-13 5.915e-07
## Residual 1.061e-01 3.257e-01
## Number of obs: 5214, groups:
## trial:day:time_factor, 5214; day:time_factor, 218; time_factor, 24
##
## Dispersion estimate for gaussian family (sigma^2): 0.106
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26812 0.01981 -13.536 < 2e-16 ***
## sequenceI 1 -0.09148 0.01932 -4.736 2.18e-06 ***
## sequenceI 3 -2.05188 0.08633 -23.768 < 2e-16 ***
## lightNight -0.25369 0.01920 -13.212 < 2e-16 ***
## treatmentUncovered (A) 0.04263 0.01915 2.227 0.026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(ggpredict(mod2.2, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))
#Null model
mod3 <- glmmTMB(c_open_normalized ~ 1, data = roach_wide,
family = gaussian(link="log"))
summary(mod3)
## Family: gaussian ( log )
## Formula: c_open_normalized ~ 1
## Data: roach_wide
##
## AIC BIC logLik deviance df.resid
## 5520.5 5534.2 -2758.3 5516.5 6940
##
##
## Dispersion estimate for gaussian family (sigma^2): 0.13
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.24761 0.01505 -82.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add fixed effects
mod3.1 <- update(mod3, c_open_normalized ~ sequence + light + treatment)
summary(mod3.1)
## Family: gaussian ( log )
## Formula: c_open_normalized ~ sequence + light + treatment
## Data: roach_wide
##
## AIC BIC logLik deviance df.resid
## 3544.6 3592.6 -1765.3 3530.6 6935
##
##
## Dispersion estimate for gaussian family (sigma^2): 0.0974
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.47386 0.07071 -34.99 < 2e-16 ***
## sequenceI 1 -0.54735 0.04770 -11.48 < 2e-16 ***
## sequenceI 2 0.21837 0.03055 7.15 8.86e-13 ***
## sequenceI 3 0.28354 0.02989 9.49 < 2e-16 ***
## lightNight 1.49549 0.06583 22.72 < 2e-16 ***
## treatmentUncovered (A) 0.08195 0.02220 3.69 0.000223 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Add random effects
mod3.2 <- update(mod3.1, . ~ . + (1 | time_factor/day/trial))
summary(mod3.2)
## Family: gaussian ( log )
## Formula:
## c_open_normalized ~ sequence + light + treatment + (1 | time_factor/day/trial)
## Data: roach_wide
##
## AIC BIC logLik deviance df.resid
## 3162.7 3231.2 -1571.4 3142.7 6932
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## trial:day:time_factor (Intercept) 3.340e-01 5.780e-01
## day:time_factor (Intercept) 2.882e-11 5.368e-06
## time_factor (Intercept) 9.904e-16 3.147e-08
## Residual 5.889e-02 2.427e-01
## Number of obs: 6942, groups:
## trial:day:time_factor, 6942; day:time_factor, 290; time_factor, 24
##
## Dispersion estimate for gaussian family (sigma^2): 0.0589
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.39186 0.05561 -43.01 < 2e-16 ***
## sequenceI 1 -0.65043 0.04127 -15.76 < 2e-16 ***
## sequenceI 2 0.09886 0.03416 2.89 0.003804 **
## sequenceI 3 0.12975 0.03364 3.86 0.000115 ***
## lightNight 1.38818 0.04682 29.65 < 2e-16 ***
## treatmentUncovered (A) 0.03996 0.02421 1.65 0.098784 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(ggpredict(mod3.2, terms = c("sequence","light")))+theme_JN()+theme(text=element_text(size=20))
To separately analyse the effect of habitat exclusion on overall probability of fish occupying a given habitat post-exclusion (sequence: I 3) let’s build a pair of GLMMs in glmmTMB with binomial distribution (link = logit). In these models, fixed effects are limited to experimental sequence (I 1, I 3) and used the same random effects specified in the main models.
mod_binary_ah <- glmmTMB(c_hab_normalized ~ as.factor(binary) + (1 | time_factor/day/trial), data = roach_binary, family = binomial())
summary(mod_binary_ah)
## Family: binomial ( logit )
## Formula:
## c_hab_normalized ~ as.factor(binary) + (1 | time_factor/day/trial)
## Data: roach_binary
##
## AIC BIC logLik deviance df.resid
## 1258.3 1284.1 -624.2 1248.3 1273
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## trial:day:time_factor (Intercept) 1.903e-09 4.363e-05
## day:time_factor (Intercept) 6.201e-02 2.490e-01
## time_factor (Intercept) 7.335e-28 2.708e-14
## Number of obs: 1278, groups:
## trial:day:time_factor, 1278; day:time_factor, 60; time_factor, 10
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1126 0.1031 -10.79 <2e-16 ***
## as.factor(binary)1 2.9091 0.1643 17.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model predicts a significant increase in the probability of fish occupying artificial habitat post-exclusion.
plot(ggpredict(mod_binary_ah, terms = c("binary")))+theme_JN()+theme(text=element_text(size=20))
mod_binary_ps <- glmmTMB(c_ps_normalized ~ as.factor(binary) + (1 | time_factor/day/trial), data = roach_binary, family = binomial())
summary(mod_binary_ps)
## Family: binomial ( logit )
## Formula:
## c_ps_normalized ~ as.factor(binary) + (1 | time_factor/day/trial)
## Data: roach_binary
##
## AIC BIC logLik deviance df.resid
## 1127.6 1153.3 -558.8 1117.6 1273
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## trial:day:time_factor (Intercept) 2.877e-09 5.364e-05
## day:time_factor (Intercept) 4.603e-12 2.145e-06
## time_factor (Intercept) 1.232e-20 1.110e-10
## Number of obs: 1278, groups:
## trial:day:time_factor, 1278; day:time_factor, 60; time_factor, 10
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.05317 0.08971 11.74 <2e-16 ***
## as.factor(binary)1 -3.38536 0.16652 -20.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model predicts a significant decrease in the probability of fish occupying pumping station post-exclusion.
plot(ggpredict(mod_binary_ps, terms = c("binary")))+theme_JN()+theme(text=element_text(size=20))
We will save the individual model predictions using ggpredict and then combine and restructure the dataframe using dplyr from the tidyverse.
c_hab_glmmm <-ggpredict(mod1.2, terms = c("sequence", "light"))
c_ps_glmmm <-ggpredict(mod2.2, terms = c("sequence", "light"))
c_open_glmmm <-ggpredict(mod3.2, terms = c("sequence", "light"))
c_hab_glmmm_treat <-ggpredict(mod1.2, terms = c("sequence", "treatment", "light[Day]"))
Next, a grouping variable for each habitat type is added. A second variable ‘present’ is added to indicate if data is present for the experimental sequence.
c_hab_glmmm <- c_hab_glmmm %>% mutate(habitat = "Artificial habitat", present = "T")
c_ps_glmmm <- c_ps_glmmm %>% mutate(habitat = "Pumping station", present = "T")
c_open_glmmm <- c_open_glmmm %>% mutate(habitat = "Open water", present = "T")
The dataframes are then combined and restructured for plotting. First, the rows are bound then NA values are omitted and dataframe attributes are removed (for simplicity, not necessity). Dummy rows are added for the sequence groups without data, and variables are then converted to factors with specific levels.
modeloutput_df <- bind_rows(c_hab_glmmm, c_ps_glmmm, c_open_glmmm)
modeloutput_df <- na.omit(modeloutput_df)
modeloutput_df <- as.data.frame(unclass(modeloutput_df))
modeloutput_df <- modeloutput_df %>%
add_row(x = 'Baseline', group = 'Day', habitat = 'Artificial habitat', present = "F") %>%
add_row(x = 'Baseline', group = 'Night', habitat = 'Artificial habitat', present = "F") %>%
add_row(x = 'I 2', group = 'Day', habitat = 'Pumping station', present = "F") %>%
add_row(x = 'I 2', group = 'Night', habitat = 'Pumping station', present = "F")
modeloutput_df$x <- factor(modeloutput_df$x, levels = c('Baseline', 'I 1', 'I 2', 'I 3'))
modeloutput_df$habitat <- as.factor(modeloutput_df$habitat)
modeloutput_df$habitat <- factor(modeloutput_df$habitat, levels = c('Pumping station', 'Open water', 'Artificial habitat'))
| x | predicted | std.error | conf.low | conf.high | group | habitat | present |
|---|---|---|---|---|---|---|---|
| I 1 | 0.3041765 | 0.0387274 | 0.2819427 | 0.3281637 | Day | Artificial habitat | T |
| I 1 | 0.1633742 | 0.0409847 | 0.1507639 | 0.1770393 | Night | Artificial habitat | T |
| I 2 | 0.9335923 | 0.0222946 | 0.8936759 | 0.9752916 | Day | Artificial habitat | T |
| I 2 | 0.5014354 | 0.0227331 | 0.4795839 | 0.5242825 | Night | Artificial habitat | T |
| I 3 | 0.8127345 | 0.0240701 | 0.7752828 | 0.8519954 | Day | Artificial habitat | T |
| I 3 | 0.4365223 | 0.0238856 | 0.4165575 | 0.4574439 | Night | Artificial habitat | T |
| Baseline | 0.7648180 | 0.0198081 | 0.7356943 | 0.7950946 | Day | Pumping station | T |
| Baseline | 0.5934450 | 0.0191550 | 0.5715783 | 0.6161482 | Night | Pumping station | T |
| I 1 | 0.6979559 | 0.0205494 | 0.6704035 | 0.7266407 | Day | Pumping station | T |
| I 1 | 0.5415647 | 0.0197917 | 0.5209591 | 0.5629853 | Night | Pumping station | T |
| I 3 | 0.0982743 | 0.0863840 | 0.0829678 | 0.1164046 | Day | Pumping station | T |
| I 3 | 0.0762539 | 0.0867094 | 0.0643361 | 0.0903794 | Night | Pumping station | T |
| Baseline | 0.0914594 | 0.0556065 | 0.0820155 | 0.1019907 | Day | Open water | T |
| Baseline | 0.3665273 | 0.0286108 | 0.3465395 | 0.3876679 | Night | Open water | T |
| I 1 | 0.0477254 | 0.0596329 | 0.0424610 | 0.0536426 | Day | Open water | T |
| I 1 | 0.1912617 | 0.0373060 | 0.1777760 | 0.2057704 | Night | Open water | T |
| I 2 | 0.1009631 | 0.0519537 | 0.0911884 | 0.1117855 | Day | Open water | T |
| I 2 | 0.4046138 | 0.0272225 | 0.3835914 | 0.4267883 | Night | Open water | T |
| I 3 | 0.1041301 | 0.0506727 | 0.0942852 | 0.1150029 | Day | Open water | T |
| I 3 | 0.4173058 | 0.0265197 | 0.3961693 | 0.4395699 | Night | Open water | T |
| Baseline | NA | NA | NA | NA | Day | Artificial habitat | F |
| Baseline | NA | NA | NA | NA | Night | Artificial habitat | F |
| I 2 | NA | NA | NA | NA | Day | Pumping station | F |
| I 2 | NA | NA | NA | NA | Night | Pumping station | F |
A similar process follows for the treatment dataframe.
c_hab_glmmm_treat <- as.data.frame(unclass(c_hab_glmmm_treat))
c_hab_glmmm_treat <- c_hab_glmmm_treat %>% mutate(present = "T")
c_hab_glmmm_treat <- c_hab_glmmm_treat %>% select(-facet)
c_hab_glmmm_treat <- c_hab_glmmm_treat %>%
add_row(x = 'Baseline', group = 'Covered (B)', present = "F") %>%
add_row(x = 'Baseline', group = 'Uncovered (A)', present = "F")
c_hab_glmmm_treat$x <- factor(c_hab_glmmm_treat$x, levels = c('Baseline', 'I 1', 'I 2', 'I 3'))
c_hab_glmmm_treat$group <- factor(c_hab_glmmm_treat$group, levels = c('Uncovered (A)', 'Covered (B)'))
| x | predicted | std.error | conf.low | conf.high | group | present |
|---|---|---|---|---|---|---|
| I 1 | 0.3041765 | 0.0387274 | 0.2819427 | 0.3281637 | Covered (B) | T |
| I 1 | 0.2791664 | 0.0388745 | 0.2586861 | 0.3012682 | Uncovered (A) | T |
| I 2 | 0.9335923 | 0.0222946 | 0.8936759 | 0.9752916 | Covered (B) | T |
| I 2 | 0.8568302 | 0.0228311 | 0.8193338 | 0.8960425 | Uncovered (A) | T |
| I 3 | 0.8127345 | 0.0240701 | 0.7752828 | 0.8519954 | Covered (B) | T |
| I 3 | 0.7459096 | 0.0245659 | 0.7108463 | 0.7827025 | Uncovered (A) | T |
| Baseline | NA | NA | NA | NA | Covered (B) | F |
| Baseline | NA | NA | NA | NA | Uncovered (A) | F |
The binary model predictions are then saved. The grouping variable is modified to allow for easy plotting.
ah_prob <- ggpredict(mod_binary_ah, terms = c("binary"))
ps_prob <- ggpredict(mod_binary_ps, terms = c("binary"))
ps_prob$group <- 2
ps_prob$group <- factor(ps_prob$group)
habitat_prob_df <- bind_rows(ah_prob, ps_prob)
| x | predicted | std.error | conf.low | conf.high | group |
|---|---|---|---|---|---|
| 0 | 0.2473919 | 0.1030883 | 0.2117147 | 0.2868931 | 1 |
| 1 | 0.8577281 | 0.1261066 | 0.8248246 | 0.8853107 | 1 |
| 0 | 0.7413837 | 0.0897146 | 0.7062698 | 0.7736453 | 2 |
| 1 | 0.0884921 | 0.1402805 | 0.0686808 | 0.1133226 | 2 |
The final model outputs can now be plotted. The following figures are finalized versions with some edits made after exporting from R, thus the code snippet will not produce an identical figure.
ggplot(data=modeloutput_df, aes(x=x, y=predicted, fill=group))+
geom_tile(aes(x=x, y=0.5,height = Inf, fill=present), alpha = 0.3, show.legend = FALSE) +
scale_fill_manual(values = c("T" = "grey80", "F" = "grey90"), guide = FALSE) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), size=0.5,width = 0.6 ,position = position_dodge(width = 0.7), show.legend = FALSE) +
#geom_path(aes(group = interaction(habitat, group), linetype = group), linewidth = 0.3 ,position = position_dodge(width = 0.7), show.legend = FALSE) +
scale_linetype_manual(values = c("Day" = "dashed", "Night" = "dotted"))+
geom_point(aes(shape=group),size=2, position = position_dodge(width = 0.7), show.legend = FALSE) +
scale_shape_manual(values = c("Day" = 20, "Night" = 4))+
scale_y_continuous(breaks = seq(0, 1,0.1), limits=c(0,1),expand=c(0.05,0)) +
scale_x_discrete(expand=c(0,0))+
labs(x = 'Experimental sequence',y= 'Habitat occupancy')+
theme_JN()+
theme(axis.text.x=element_text(size=8),
strip.background = element_rect(fill = "grey90"),
panel.spacing.x =unit(0, "lines") ) +
facet_grid(~ habitat, scales = "fixed")+
geom_text(data = modeloutput_df %>% filter(present == "F"), aes(x = x, y = 0.5, label = "Unavailable"), size = 8/.pt, angle = 90, fontface = "italic")
ggplot(data=c_hab_glmmm_treat, aes(x=x, y=predicted, fill=group))+
geom_tile(aes(x=x, y=0.5,height = Inf, fill=present), alpha = 0.3, show.legend = FALSE) +
scale_fill_manual(values = c("T" = NA, "F" = "grey90"), guide = FALSE) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), size=0.5,width = 0.6 ,position = position_dodge(width = 0.7), show.legend = FALSE) +
geom_point(aes(shape=group),size=1.5, position = position_dodge(width = 0.7), show.legend = FALSE) +
scale_shape_manual(values = c("Covered (B)" = 20, "Uncovered (A)" = 4))+
scale_y_continuous(breaks = seq(0, 1,0.1), limits=c(0,1),expand=c(0.05,0)) +
scale_x_discrete(expand=c(0,0))+
labs(x = 'Experimental sequence',y= 'Daytime artifical habitat occupancy')+
coord_cartesian(clip="off")+
theme_JN()+
theme(axis.text.x=element_text(size=8)) +
geom_text(data = c_hab_glmmm_treat %>% filter(present == "F"), aes(x = x, y = 0.5, label = "Unavailable"), size = 8/.pt, angle = 90, fontface = "italic")
ggplot(data=habitat_prob_df, aes(x = factor(x, labels = c("Pre-exclusion", "Post-exclusion")), y=predicted, fill=group))+
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width=.2) +
geom_line(aes(group=group,linetype = group), show.legend = FALSE)+
geom_point(aes(shape=group),size=2, show.legend = FALSE) +
scale_shape_manual(values = c("1" = 20, "2" = 4))+
scale_linetype_manual(values = c("1" = "dashed", "2" = "dotted"))+
scale_y_continuous(breaks = seq(0, 1, 0.1), limits = c(0, 1), expand = c(0.05, 0),
labels = scales::percent(seq(0, 1, 0.1), scale = 100)) +
scale_x_discrete()+
labs(x = 'Pumping station exclusion',y=expression(atop(NA, atop(textstyle('Predicted probabaility of'), textstyle('daytime habitat occupancy')))))+
coord_cartesian(clip="off")+
theme_JN()+
theme(axis.text.x=element_text(size=8))
Post-hoc analysis can now be performed to support the model outputs. Tests of relevance here include paired t.tests and repeated measures ANOVA.
#AH
anov1<- aov(c_hab_normalized ~ sequence + Error(trial), data = roach_wide%>%filter(sequence!="Baseline"))
summary(anov1)
##
## Error: trial
## Df Sum Sq Mean Sq F value Pr(>F)
## sequence 1 43.70 43.70 9.683 0.00508 **
## Residuals 22 99.29 4.51
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## sequence 2 171.6 85.78 934.6 <2e-16 ***
## Residuals 5140 471.8 0.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#AH
anov2<- aov(c_ps_normalized ~ sequence + Error(trial), data = roach_wide%>%filter(sequence!="I 2"))
summary(anov2)
##
## Error: trial
## Df Sum Sq Mean Sq F value Pr(>F)
## sequence 1 126.0 125.97 13.15 0.00149 **
## Residuals 22 210.8 9.58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## sequence 2 380.5 190.24 2446 <2e-16 ***
## Residuals 5188 403.5 0.08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#OW
anov3<- aov(c_open_normalized ~ sequence + Error(trial), data = roach_wide)
summary(anov3)
##
## Error: trial
## Df Sum Sq Mean Sq F value Pr(>F)
## sequence 1 15.24 15.239 3.013 0.0966 .
## Residuals 22 111.27 5.058
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## sequence 3 49.5 16.487 157.5 <2e-16 ***
## Residuals 6915 723.8 0.105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of the repeated measures ANOVAs suggest thee is a significant difference in habitat occupancy across experimental sequence.
For the paired t.tests it will be necessary to create new dataframes
with equal samples between grouping variables.
First identify the group lengths.
table(roach_wide %>%
filter(light == "Day") %>%
.$sequence)
##
## Baseline I 1 I 2 I 3
## 690 648 648 630
We will use the minimal group length (630) for group comparisons.
Paired comparison between baseline and intervention 1 pumping station.
i1_base_ps<- roach_wide %>%
filter(light == "Day" & sequence %in% c("Baseline", "I 1")) %>%
group_by(sequence) %>%
mutate(row_num = row_number()) %>%
filter(row_num <= 630) %>%
ungroup() %>%
select(-row_num)
t.test(c_ps_normalized ~ sequence,data=i1_base_ps, paired = TRUE)
##
## Paired t-test
##
## data: c_ps_normalized by sequence
## t = 7.2579, df = 629, p-value = 1.164e-12
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.07747805 0.13495581
## sample estimates:
## mean difference
## 0.1062169
Paired comparison between intervention1 and intervention 2 in artificial habitat.
i1_i2_hab<- roach_wide %>%
filter(light == "Day" & sequence %in% c("I 1", "I 2")) %>%
group_by(sequence) %>%
mutate(row_num = row_number()) %>%
filter(row_num <= 630) %>%
ungroup() %>%
select(-row_num)
t.test(c_hab_normalized ~ sequence,data=i1_i2_hab, paired = TRUE)
##
## Paired t-test
##
## data: c_hab_normalized by sequence
## t = -37.156, df = 629, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.6448020 -0.5800657
## sample estimates:
## mean difference
## -0.6124339
Paired comparison between intervention 1 and intervention 3 in pumping station.
i1_i3_ps <- roach_wide %>%
filter(sequence %in% c("I 1", "I 3") & light == "Day") %>%
group_by(sequence) %>%
filter(!(sequence == "I 1" & row_number() > 630)) %>%
ungroup()
t.test(c_ps_normalized ~ sequence,data=i1_i3_ps, paired = TRUE)
##
## Paired t-test
##
## data: c_ps_normalized by sequence
## t = 42.112, df = 629, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.6350747 0.6972004
## sample estimates:
## mean difference
## 0.6661376
Paired comparison between intervention 2 and intervention 3 in artificial habitat.
i2_i3_hab<- roach_wide %>%
filter(sequence %in% c("I 2", "I 3") & light == "Day") %>%
group_by(sequence) %>%
filter(!(sequence == "I 2" & row_number() > 630)) %>%
ungroup()
t.test(c_hab_normalized ~ sequence,data=i2_i3_hab, paired = TRUE)
##
## Paired t-test
##
## data: c_hab_normalized by sequence
## t = -0.51646, df = 629, p-value = 0.6057
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.03366707 0.01964590
## sample estimates:
## mean difference
## -0.007010582
Paired comparison between intervention 1 and intervention 3 in artificial habit.
i1_i3_hab<- roach_wide %>%
filter(sequence %in% c("I 1", "I 3") & light == "Day") %>%
group_by(sequence) %>%
filter(!(sequence == "I 1" & row_number() > 630)) %>%
ungroup()
t.test(c_hab_normalized ~ sequence,data=i1_i3_hab, paired = TRUE)
##
## Paired t-test
##
## data: c_hab_normalized by sequence
## t = -38.233, df = 629, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.6512604 -0.5876285
## sample estimates:
## mean difference
## -0.6194444
Paired comparison between habitat treatments in artificial habitat.
treatment_hab<- roach_wide %>%
filter(sequence %in% c("I 1","I 2", "I 3") & light == "Day") %>%
group_by(sequence) %>%
filter(!(sequence %in% c("I 1", "I 2") & row_number() > 630)) %>%
ungroup()
t.test(c_hab_normalized ~ treatment,data=treatment_hab, paired = TRUE)
##
## Paired t-test
##
## data: c_hab_normalized by treatment
## t = 4.2772, df = 944, p-value = 2.085e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.0273454 0.0737128
## sample estimates:
## mean difference
## 0.0505291
Paired comparison between treatments in pumping station.
treatment_ps<- roach_wide %>%
filter(sequence %in% c("Baseline", "I 1", "I 3") & light == "Day") %>%
group_by(sequence) %>%
filter(!(sequence %in% c("Baseline", "I 1") & row_number() > 630)) %>%
ungroup()
t.test(c_ps_normalized ~ treatment,data=treatment_ps, paired = TRUE)
##
## Paired t-test
##
## data: c_ps_normalized by treatment
## t = -4.3869, df = 944, p-value = 1.279e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.07606882 -0.02904582
## sample estimates:
## mean difference
## -0.05255732
During the video analysis 20,827 individual fish counts were made (i.e., a count of 0 – 12 h-1 for each habitat) and therefore we used Generalized Linear Mixed Effect models (GLMMs) to account for repeated measures (trial) and temporal dependency (hour, day) and to investigate habitat occupancy and preference under baseline, habitat introduction and habitat exclusion conditions. To standardise counts and generalise interpretation, the discrete fish count data (0 – 12) from the pumping station and artificial habitats, and open water were rescaled to the continuous variable ‘habitat occupancy’ (0 – 1) by applying min-max scaling (R function ‘scale’) prior to analysis. Individual GLMMs (R function ‘glmmTMB’ in package ‘glmmTMB’) were built to estimate habitat occupancy in the three response categories (pumping station, open water, artificial habitat). The rescaled response variables were left-skewed (i.e., many values close to 0) and thus a Gamma distribution was not used. Instead, zeros were removed from the rescaled response variables by adding 1 × 10−9 and gaussian distributions were used in the models with a log link applied to account for heteroscedasticity. We defined the model structure a priori, including experimental sequence (levels = baseline, I 1, I2, I3), light (levels = day, night), and treatment (levels = unsheltered (A), sheltered (B)) as fixed effects. The baseline and intervention 2 categories were not included in the artificial habitat and pumping station models, respectively, as these were inherently not measured according to the experimental design. Repeated measures and temporal dependency were treated by including the nested random effects of hour, day, and trial, which all had variances higher than zero, improved the models Akaike’s Information Criterion (AIC) and goodness of fit (Table 1). To separately analyse the effect of habitat exclusion on overall probability of fish occupying a given habitat post-exclusion (sequence: I 3) we fitted a pair of GLMMs in glmmTMB with binomial distribution (link = logit). In these models, fixed effects were limited to experimental sequence (I 1, I 3) and used the same random effects specified in the main models. Model assessment was performed on the final models by examining the predicted versus residual diagnostic plots according to Zurr.et al. (2007) (R function ‘simulateResiduals’ in package ‘DHARma’). Residuals of the artificial habitat model were normal, but the pumping station and open water variables deviated. However, sample sizes of the response variables were sufficiently large and thus paired t tests (R function ‘t.test’) and repeated measures ANOVA (R function ‘aov’) were used for post-hoc group comparisons when there were significant fixed effect differences in the models (Fagerland, 2012). Model predictions (means ± 95% CI) were calculated using R function ‘ggpredict’ in package ‘ggeffects’. All data were analysed using R version 4.3.1 (RCore Team, 2022) in RStudio 2023.06.0 (RStudio Team, 2022) and statistical figures were created using R packages ‘ggplot2’ and ‘ggpubr’.
The raw data included a total of 30120, 29004, and 23924 fish counted across the 24 12-day trials in the artificial and pumping station habitats, and open water, respectively. After rescaling the response variables to represent habitat occupancy (0 – 1) we found that habitat occupancy was highest during the day, and night had a negative effect on habitat occupancy in the pumping station (GLMM: -0.253 ± 0.019, p = < 0.001) and artificial habitats (GLMM: -0.621 ± 0.025, p = < 0.001); open water occupancy was lowest during the day and increased at night (GLMM: 1.388 ± 0.046, p = < 0.001) (Figure 3, Table 2). The nocturnal effect i.e., fish occupying structured habitat during the day and dispersing in open water at night was fixed throughout the study, except where open water dispersal appeared to be reduced when we introduced artificial habitat (Figure 3b). No significant day-to-day variations in this relationship were observed within each experimental sequence (i.e., daytime occupancy was similar 24h vs 72h post-intervention). Our results show that habitat availability (i.e., experimental sequence) had a significant effect on habitat occupancy in the pumping station (repeated measures ANOVA: df = 2 F = 2446 p = <0.001), artificial habitat (repeated measures ANOVA: df = 2 F = 934.6 p = <0.001) and open water (repeated measures ANOVA: df = 3 F = 157.5 p = <0.001) models. Our hypothesis (i) that fish will occupy pumping station habitat during the day was supported during a baseline observation during which pumping station occupancy was predicted to reach 0.764 ± 0.019 (GLMM: p = < 0.001) (Figure 3a, Table 2). We introduced artificial habitat in a pre-exclusion period (intervention 1), but a predicted occupancy of 0.304 ± 0.038 (GLMM: p = < 0.001) in the artificial habitat only partially supported our hypothesis (ii) and suggested fish preferred the pumping station in which occupancy was marginally reduced to 0.697 ± 0.019 compared to baseline measurements (GLMM: p = < 0.001) (paired t.test: t = 4.846, p = <0.001) (Figure 3, Table 2). In support of our hypothesis (iii) artificial habitat occupancy was positively correlated with the exclusion period when we removed fish from the pumping station (intervention 2), during which daytime artificial habitat occupancy significantly increased to 0.933 ± 0.22 (GLMM: p = < 0.001) (paired t.test: t = -32.673, p = <0.001) (Figure 3c, Table 2). We reintroduced the pumping station in a post-exclusion period (intervention 3) to determine whether habitat exclusion affected the preference for pumping station demonstrated during baseline and pre-exclusion (intervention 1) observations. Our results supported our hypothesis (iv) and show that daytime habitat occupancy in the pumping station was significantly reduced to 0.098 ± 0.086 (GLMM: p = < 0.001) when compared to the pre-exclusion period (paired t.test: t = 42.112, p = <0.001) (Figure 3a, Table 2). In contrast, artificial habitat occupancy was significantly increased to 0.812 ± 0.024 (GLMM: p = < 0.001) when compared to pre-exclusion (paired t.test: t = -38.233, p = < 0.001) and only marginally reduced when compared to the exclusion period (paired t.test: t = - 0.516, p = 0.605) (Figure 3c, Table 2). In support of our hypothesis (v), we found significant effects of artificial habitat treatment where artificial habitat occupancy was negatively correlated with unsheltered (A) treatments (GLMM: -0.085 ± 0.015, p = < 0.001) and was significantly higher during trials which received sheltered treatments (paired t.test: t = 4.277, p = < 0.001) (Figure 4; Table 2). Accordingly, pumping station occupancy was positively correlated with unsheltered (A) treatments (GLMM: 0.042 ± 0.019, p = 0.026) and was significantly higher during trials which received unsheltered treatments (paired t.test: t = - 4.386, p = < 0.001) (Table 2). Open water occupancy was unaffected by treatment (GLMM: 0.039 ± 0.024, p = 0.098) (Table 2). Overall, our results show that daytime artificial habitat occupancy increased 2.7x by the end of the study, reaching 0.745 ± 0.024 and 0.812 ± 0.024 in the unsheltered and sheltered treatments, respectively. Based on our results, habitat exclusion was considered a primary driver for whether fish would preferentially select a pumping station or artificial habitat. Accordingly, we performed a pair of binomial GLMMs to examine the isolated effect of exclusion on habitat occupancy probability. These models add further support for our hypothesis (iv) and show the probability of fish occupying the pumping station was negatively correlated with the post-exclusion period (GLMM: -3.385 ± 0.166, p = < 0.001) during which occupancy probability significantly decreased from 74.1 ± 8.97 % in the pre-exclusion period to 8.84 ± 14.02 % in the post-exclusion period (Figure 5, Table 2). In contrast, we found that the probability of fish occupying artificial habitat was positively correlated with the post-exclusion period (GLMM: 2.909 ± 0.164, p = < 0.001) and occupancy probability significantly increased from 24.7 ± 10.3 % in the pre-exclusion period to 85.7 ± 12.6 % in the post-exclusion period (Figure 5, Table 2).